This lesson covers packages primarily by Hadley Wickham for tidying data and then working with it in tidy form.
The packages we are using in this lesson are all from CRAN, so we can install them with install.packages.
install.packages(c(
# Hadley Wickham packages
"readr", # read tabular data
"tidyr", # data frame tidying functions
"dplyr", # general data frame manipulation
"ggplot2", # flexible plotting
# Viridis color scale
"viridis",
# (Advanced!) Tidy linear modelling
"broom"
))
library(readr)
library(tidyr)
library(dplyr)
library(ggplot2)
library(viridis)
library(broom)
These packages usually have useful documentation in the form of “vignettes”. These are readable on the CRAN website, or within R:
vignette()
vignette(package="dplyr")
vignette("introduction", package="dplyr")
Let’s continue our examination of the FastQC output. If you’re starting fresh for this lesson, you can load the necessary data frame with:
bigtab <- read_csv("r-more-files/fastqc.csv")
bigtab$grade <- factor(bigtab$grade, c("FAIL","WARN","PASS"))
bigtab
## # A tibble: 72 x 3
## grade test file
## <fctr> <chr> <chr>
## 1 PASS Basic Statistics Day0.fastq
## 2 PASS Per base sequence quality Day0.fastq
## 3 PASS Per tile sequence quality Day0.fastq
## 4 PASS Per sequence quality scores Day0.fastq
## 5 FAIL Per base sequence content Day0.fastq
## 6 WARN Per sequence GC content Day0.fastq
## 7 PASS Per base N content Day0.fastq
## 8 PASS Sequence Length Distribution Day0.fastq
## 9 PASS Sequence Duplication Levels Day0.fastq
## 10 WARN Overrepresented sequences Day0.fastq
## # ... with 62 more rows
bigtab is a “tibble”, which is Hadley Wickham’s slightly improved data frame. One convenient feature is that it only shows a few rows of the data frame when you print it. If you do want to see the whole table, you can use as.data.frame, or use the View function to view it in a tab.
as.data.frame(bigtab)
View(bigtab)
dplyr gives us a collection of convenient “verbs” for manipulating data frames.
Say we want to know all the tests that failed. filter provides a convenient way to get these.
filter(bigtab, grade == "FAIL")
## # A tibble: 8 x 3
## grade test file
## <fctr> <chr> <chr>
## 1 FAIL Per base sequence content Day0.fastq
## 2 FAIL Per base sequence content Day4.fastq
## 3 FAIL Per base sequence content Day7.fastq
## 4 FAIL Per sequence GC content Day7.fastq
## 5 FAIL Per base sequence content Day10.fastq
## 6 FAIL Per base sequence content Day15.fastq
## 7 FAIL Per base sequence content Day20.fastq
## 8 FAIL Per sequence GC content Day20.fastq
Something is magic here: we do not have grade in our environment, only within the data frame.
Rather than filtering, we might instead want to sort the data so the most important rows are at the top. arrange sorts a data frame by one or more columns.
arrange(bigtab, grade)
## # A tibble: 72 x 3
## grade test file
## <fctr> <chr> <chr>
## 1 FAIL Per base sequence content Day0.fastq
## 2 FAIL Per base sequence content Day4.fastq
## 3 FAIL Per base sequence content Day7.fastq
## 4 FAIL Per sequence GC content Day7.fastq
## 5 FAIL Per base sequence content Day10.fastq
## 6 FAIL Per base sequence content Day15.fastq
## 7 FAIL Per base sequence content Day20.fastq
## 8 FAIL Per sequence GC content Day20.fastq
## 9 WARN Per sequence GC content Day0.fastq
## 10 WARN Overrepresented sequences Day0.fastq
## # ... with 62 more rows
# desc( ) can be used to reverse the sort order
arrange(bigtab, desc(grade))
## # A tibble: 72 x 3
## grade test file
## <fctr> <chr> <chr>
## 1 PASS Basic Statistics Day0.fastq
## 2 PASS Per base sequence quality Day0.fastq
## 3 PASS Per tile sequence quality Day0.fastq
## 4 PASS Per sequence quality scores Day0.fastq
## 5 PASS Per base N content Day0.fastq
## 6 PASS Sequence Length Distribution Day0.fastq
## 7 PASS Sequence Duplication Levels Day0.fastq
## 8 PASS Adapter Content Day0.fastq
## 9 PASS Kmer Content Day0.fastq
## 10 PASS Basic Statistics Day4.fastq
## # ... with 62 more rows
dplyr’s select function is for subsetting, renaming, and reordering columns. Old columns can be referred to by name or by number.
select(bigtab, test,grade)
## # A tibble: 72 x 2
## test grade
## <chr> <fctr>
## 1 Basic Statistics PASS
## 2 Per base sequence quality PASS
## 3 Per tile sequence quality PASS
## 4 Per sequence quality scores PASS
## 5 Per base sequence content FAIL
## 6 Per sequence GC content WARN
## 7 Per base N content PASS
## 8 Sequence Length Distribution PASS
## 9 Sequence Duplication Levels PASS
## 10 Overrepresented sequences WARN
## # ... with 62 more rows
select(bigtab, 2,1)
## # A tibble: 72 x 2
## test grade
## <chr> <fctr>
## 1 Basic Statistics PASS
## 2 Per base sequence quality PASS
## 3 Per tile sequence quality PASS
## 4 Per sequence quality scores PASS
## 5 Per base sequence content FAIL
## 6 Per sequence GC content WARN
## 7 Per base N content PASS
## 8 Sequence Length Distribution PASS
## 9 Sequence Duplication Levels PASS
## 10 Overrepresented sequences WARN
## # ... with 62 more rows
select(bigtab, foo=file, bar=test, baz=grade)
## # A tibble: 72 x 3
## foo bar baz
## <chr> <chr> <fctr>
## 1 Day0.fastq Basic Statistics PASS
## 2 Day0.fastq Per base sequence quality PASS
## 3 Day0.fastq Per tile sequence quality PASS
## 4 Day0.fastq Per sequence quality scores PASS
## 5 Day0.fastq Per base sequence content FAIL
## 6 Day0.fastq Per sequence GC content WARN
## 7 Day0.fastq Per base N content PASS
## 8 Day0.fastq Sequence Length Distribution PASS
## 9 Day0.fastq Sequence Duplication Levels PASS
## 10 Day0.fastq Overrepresented sequences WARN
## # ... with 62 more rows
You can also remove specific columns like this:
select(bigtab, -file)
## # A tibble: 72 x 2
## grade test
## <fctr> <chr>
## 1 PASS Basic Statistics
## 2 PASS Per base sequence quality
## 3 PASS Per tile sequence quality
## 4 PASS Per sequence quality scores
## 5 FAIL Per base sequence content
## 6 WARN Per sequence GC content
## 7 PASS Per base N content
## 8 PASS Sequence Length Distribution
## 9 PASS Sequence Duplication Levels
## 10 WARN Overrepresented sequences
## # ... with 62 more rows
Say we want to convert PASS/WARN/FAIL into a numeric score, so we can produce some numerical summaries. The scoring scheme will be:
fwp <- c("FAIL","WARN","PASS")
scoring <- tibble(grade=factor(fwp,levels=fwp), score=c(0,0.5,1))
# Or:
# scoring <- data.frame(grade=factor(fwp,levels=fwp), score=c(0,0.5,1))
scoring
## # A tibble: 3 x 2
## grade score
## <fctr> <dbl>
## 1 FAIL 0.0
## 2 WARN 0.5
## 3 PASS 1.0
In the introduction to R day, we saw the merge function. dplyr has its own version of this in the form of several functions: left_join, right_join, inner_join, full_join.
The difference between these functions is what happens when there is a row in one data frame without a corresponding row in the other data frame. inner_join discards such rows. full_join always keeps them, filling in missing data with NA. left_join always keeps rows from the first data frame. right_join always keeps rows from the second data frame.
Often we use a join to augment a data frame with some extra information. left_join is a good default choice for this as it will never delete rows from the data frame that is being augmented.
scoretab <- left_join(bigtab, scoring, by="grade")
scoretab
## # A tibble: 72 x 4
## grade test file score
## <fctr> <chr> <chr> <dbl>
## 1 PASS Basic Statistics Day0.fastq 1.0
## 2 PASS Per base sequence quality Day0.fastq 1.0
## 3 PASS Per tile sequence quality Day0.fastq 1.0
## 4 PASS Per sequence quality scores Day0.fastq 1.0
## 5 FAIL Per base sequence content Day0.fastq 0.0
## 6 WARN Per sequence GC content Day0.fastq 0.5
## 7 PASS Per base N content Day0.fastq 1.0
## 8 PASS Sequence Length Distribution Day0.fastq 1.0
## 9 PASS Sequence Duplication Levels Day0.fastq 1.0
## 10 WARN Overrepresented sequences Day0.fastq 0.5
## # ... with 62 more rows
“grade” is here acting as the key, allowing corresponding rows in the data frames to be joined together.
One important thing that all the join functions do: if multiple rows have the same key in either data frame, all ways of combining the two sets of rows will be included in the result. So, here, rows from the scoring data frame have been copied many times in the output.
Joins are the key tool for integrating different types of data, based on some common key such as gene id.
mutate lets us add or overwrite columns by computing a new value for them.
mutate(scoretab, doublescore = score*2)
## # A tibble: 72 x 5
## grade test file score doublescore
## <fctr> <chr> <chr> <dbl> <dbl>
## 1 PASS Basic Statistics Day0.fastq 1.0 2
## 2 PASS Per base sequence quality Day0.fastq 1.0 2
## 3 PASS Per tile sequence quality Day0.fastq 1.0 2
## 4 PASS Per sequence quality scores Day0.fastq 1.0 2
## 5 FAIL Per base sequence content Day0.fastq 0.0 0
## 6 WARN Per sequence GC content Day0.fastq 0.5 1
## 7 PASS Per base N content Day0.fastq 1.0 2
## 8 PASS Sequence Length Distribution Day0.fastq 1.0 2
## 9 PASS Sequence Duplication Levels Day0.fastq 1.0 2
## 10 WARN Overrepresented sequences Day0.fastq 0.5 1
## # ... with 62 more rows
This is equivalent to
scoretab2 <- scoretab
scoretab2$doublescore <- scoretab2$score * 2
summarize lets us compute summaries of data.
summarize(scoretab, total=sum(score))
## # A tibble: 1 x 1
## total
## <dbl>
## 1 60
We really want to summarize the data grouped by file, so we can see if there are any particularly bad files. This is achieved using the group_by “adjective”. group_by gives the data frame a grouping using one or more columns, which modifies the subsequent call to summarize.
group_by(scoretab, file)
## Source: local data frame [72 x 4]
## Groups: file [6]
##
## grade test file score
## <fctr> <chr> <chr> <dbl>
## 1 PASS Basic Statistics Day0.fastq 1.0
## 2 PASS Per base sequence quality Day0.fastq 1.0
## 3 PASS Per tile sequence quality Day0.fastq 1.0
## 4 PASS Per sequence quality scores Day0.fastq 1.0
## 5 FAIL Per base sequence content Day0.fastq 0.0
## 6 WARN Per sequence GC content Day0.fastq 0.5
## 7 PASS Per base N content Day0.fastq 1.0
## 8 PASS Sequence Length Distribution Day0.fastq 1.0
## 9 PASS Sequence Duplication Levels Day0.fastq 1.0
## 10 WARN Overrepresented sequences Day0.fastq 0.5
## # ... with 62 more rows
summarize(group_by(scoretab, file), total=sum(score))
## # A tibble: 6 x 2
## file total
## <chr> <dbl>
## 1 Day0.fastq 10.0
## 2 Day10.fastq 10.0
## 3 Day15.fastq 10.0
## 4 Day20.fastq 9.5
## 5 Day4.fastq 10.5
## 6 Day7.fastq 10.0
The special function n() can be used within summarize to get the number of rows. (This also works in mutate, but is most useful in summarize.)
summarize(group_by(bigtab, grade), count=n())
## # A tibble: 3 x 2
## grade count
## <fctr> <int>
## 1 FAIL 8
## 2 WARN 8
## 3 PASS 56
So summarize lets us do the things we did with table and tapply, but staying tidy by using data frames.
Tip: group_by also affects other verbs, such as mutate. This is more advanced than we will be going in to today. Grouping can be removed with ungroup.
See also functions for common uses of summarize: count, distinct.
We often want to string together a series of dplyr functions. This is achieved using dplyr’s new pipe operator, %>%. This takes the value on the left, and passes it as the first argument to the function call on the right. So the previous summarization could be written:
bigtab %>% group_by(grade) %>% summarize(count=n())
%>% isn’t limited to dplyr functions. It’s an alternative way of writing any R code.
rep("hello", 5)
## [1] "hello" "hello" "hello" "hello" "hello"
"hello" %>% rep(5)
## [1] "hello" "hello" "hello" "hello" "hello"
This section builds on what we saw of ggplot2 in the introductory R day. Recall that we could assign columns of a data from to aesthetics–x and y position, color, etc–and then add “geom”s to draw the data.
The idea of adding geoms is rather like the dplyr pipe. ggplot2 predates dplyr, and Hadley Wickham has had a progression of ideas. It will probably be possible to use %>% instead of + in the not too distant future.
With ggplot2 we don’t need to summarize bigtab, we can just view the whole data set.
ggplot(bigtab,aes(x=file,y=test,color=grade)) + geom_point()
With categorical data on the x and y axes, an even better geom to use is geom_tile.
ggplot(bigtab,aes(x=file,y=test,fill=grade)) + geom_tile()
ggplot2 offers a very wide array of ways to adjust a plot.
plot <- ggplot(bigtab,aes(x=file,y=test,fill=grade)) +
geom_tile(color="black",size=0.5) +
labs(x="",y="",fill="") + # Remove axis labels
coord_fixed() + # Square tiles
theme_minimal() + # Minimal theme, no grey background
theme(panel.grid=element_blank(), # No underlying grid lines
axis.text.x=element_text( # Vertical text on x axis
angle=90,vjust=0.5,hjust=1))
plot
We would ideally also reorder the x and y axes meaningfully, and refine the x axis labels.
Plots can be saved with ggsave. Width and height are given in inches, and an image resolution in Dots Per Inch should also be given. The width and height will determine the relative size of text and graphics, so increasing the resolution is best done by adjusting the DPI. Compare the files produced by:
ggsave("plot1.png", plot, width=5, height=5, dpi=600)
ggsave("plot2.png", plot, width=10, height=10, dpi=300)
It may be necessary to edit a plot further in a program such as Inkscape or Adobe Illustrator. To allow this, the image needs to be saved in a “vector” format such as SVG, EPS, or PDF. In this case, the DPI argument isn’t needed.
We will now look at a small gene expression data set. To simplify somewhat: The data was generated from a multiplex PCR targetting 44 genes. The PCR product is quantified by counting reads from high throughput sequencing. The experiment is of yeast released synchronously into cell cycling, so that we can see which genes are active at different points in the cycle. Several strains of yeast have been used, a wildtype and several gene knock-downs. Each has nine time points, covering two cell cycles (two hours).
This is much smaller than a typical RNA-Seq dataset. We are only looking at 44 genes, rather than all of the thousands of genes in the yeast genome.
Recall the three activities involved in data analysis:
This data set requires all of them.
Hint: You can select from the top of a pipeline to partway through and press Ctrl-Enter or Command-Enter to see the value part-way through the pipeline.
# Use readr's read_csv function to load the file
counts_untidy <- read_csv("r-more-files/read-counts.csv")
counts <- counts_untidy %>%
gather(sample, count, -Feature, factor_key=TRUE) %>%
separate(sample, sep=":", into=c("strain","time"), convert=TRUE, remove=FALSE) %>%
mutate(
strain = factor(strain, levels=c("WT","SET1","RRP6","SET1-RRP6")),
set1 = (strain == "SET1" | strain == "SET1-RRP6"),
rrp6 = (strain == "RRP6" | strain == "SET1-RRP6")
) %>%
filter(time >= 2) %>%
select(sample, strain, set1, rrp6, time, gene=Feature, count)
summary(counts)
## sample strain set1 rrp6
## WT:2 : 45 WT :405 Mode :logical Mode :logical
## WT:3 : 45 SET1 :405 FALSE:810 FALSE:810
## WT:4 : 45 RRP6 :405 TRUE :810 TRUE :810
## WT:5 : 45 SET1-RRP6:405 NA's :0 NA's :0
## WT:6 : 45
## WT:7 : 45
## (Other):1350
## time gene count
## Min. : 2 Length:1620 Min. : 0
## 1st Qu.: 4 Class :character 1st Qu.: 172
## Median : 6 Mode :character Median : 666
## Mean : 6 Mean : 3612
## 3rd Qu.: 8 3rd Qu.: 2548
## Max. :10 Max. :143592
##
Here we’ve used two tidying functions from tidyr:
gather gathers columns into a key column and a value column. It’s similar to reshape2’s melt function.
separate splits character columns by a separator string (actually, a regular expression), creating new columns.
“Time point 1” was omitted because it isn’t actually part of the time series, it’s from cells in an unsynchronized state. If we wanted to be even tidier, the remaining time points could be recoded with the actual time within the experiment – they’re at 15 minute intervals.
ggplot(counts, aes(x=sample, y=count)) + geom_boxplot() + coord_flip()
We immediately see from a Tukey box plot that the data is far from normal – there are outliers many box-widths to the right of the box. Perhaps a log transformation is in order.
ggplot(counts, aes(x=sample, y=log2(count))) + geom_boxplot() + coord_flip()
## Warning: Removed 14 rows containing non-finite values (stat_boxplot).
ggplot(counts, aes(x=log2(count), group=sample)) + geom_line(stat="density")
## Warning: Removed 14 rows containing non-finite values (stat_density).
Log transformation has greatly improved things, although now there are more outliers left of the whiskers, and some zeros we will need to be careful of. We also see one of the samples has less reads than the others.
The gene SRP68 was included as a control housekeeping gene. Its expression level should be the same in all samples. We will divide counts in each sample by the count for SRP68 to correct for library size, then log transform the data. We add a small constant when log transforming so that zeros do not become negative infinity.
normalizer <- counts %>%
filter(gene == "SRP68") %>%
select(sample, norm=count)
moderation <- 0.5/mean(normalizer$norm)
counts_norm <- counts %>%
left_join(normalizer, by="sample") %>%
mutate(
norm_count = count/norm,
log_norm_count = log2(norm_count+moderation)
)
ggplot(counts_norm, aes(x=sample, y=log_norm_count)) + geom_boxplot() + coord_flip()
ggplot(counts_norm, aes(x=time, y=gene, fill=log_norm_count)) +
geom_tile() + facet_grid(~ strain) +
scale_fill_viridis() + theme_minimal()
ggplot(counts_norm, aes(x=time, y=strain, fill=log_norm_count)) +
geom_tile() + facet_wrap(~ gene) +
scale_fill_viridis() + theme_minimal()
ggplot(counts_norm, aes(x=time, y=log_norm_count, color=strain, group=strain)) +
geom_line() + facet_wrap(~ gene, scale="free")
Hint: intToUtf8(utf8ToInt(“vtf!gjmufs”)-1)
Hint: intToUtf8(utf8ToInt(“xvh#jurxsbe|/#vxppdul}h/#vg/#dqg#duudqjh”)-3)
log_norm_count.The idea of this section is to give a taste of what is possible, so that you know what topics to read up on or ask an expert for help with if this is the sort of analysis you need. It will also provide a little context for advanced forms of RNA-Seq differential expression analysis.
sut476 <- counts_norm %>% filter(gene=="SUT476")
sut476_wt <- sut476 %>% filter(strain=="WT")
ggplot(sut476_wt,aes(x=time,y=log_norm_count)) +
geom_point() +
geom_smooth(method="lm")
ggplot2 includes a handy geom for fitting a curve or line to data, but what exactly is it doing? What if we want to work with that fitted curve ourselves? ggplot2 can use various curve fitting methods, the simplest of which is a linear model with “lm”.
model <- lm(log_norm_count ~ time, data=sut476_wt)
model
##
## Call:
## lm(formula = log_norm_count ~ time, data = sut476_wt)
##
## Coefficients:
## (Intercept) time
## -5.21805 0.08813
# Note: The p-values listed by summary aren't always meaningful.
# It makes no sense to remove the intercept term but retain the time term.
# Significance testing can better be done with anova() and the glht package.
summary(model)
##
## Call:
## lm(formula = log_norm_count ~ time, data = sut476_wt)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.9458 -0.4717 -0.1152 0.2420 1.5235
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.21805 0.66749 -7.817 0.000106 ***
## time 0.08813 0.10219 0.862 0.417017
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7915 on 7 degrees of freedom
## Multiple R-squared: 0.09605, Adjusted R-squared: -0.03309
## F-statistic: 0.7438 on 1 and 7 DF, p-value: 0.417
model.matrix(log_norm_count ~ time, data=sut476_wt)
## (Intercept) time
## 1 1 2
## 2 1 3
## 3 1 4
## 4 1 5
## 5 1 6
## 6 1 7
## 7 1 8
## 8 1 9
## 9 1 10
## attr(,"assign")
## [1] 0 1
A model can be tested against a simpler null model using anova. Is the fit sufficiently better to justify the extra complexity?
null_model <- lm(log_norm_count ~ 1, data=sut476_wt)
anova(null_model, model)
## Analysis of Variance Table
##
## Model 1: log_norm_count ~ 1
## Model 2: log_norm_count ~ time
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 8 4.8518
## 2 7 4.3858 1 0.46601 0.7438 0.417
A linear trend is not supported, in comparison to the null hypothesis that expression is constant, by this data. Also useful for hypothesis testing, the multcomp package can be used to test comparisons of interest such as a set of pairwise comparisons. This is similar to testing of contrasts in packages for differential expression analysis such as limma, edgeR, or DESeq2.
augment from the broom package gives various diagnostic statistics.
augment(model, sut476_wt) %>% head
## sample strain set1 rrp6 time gene count norm norm_count
## 1 WT:2 WT FALSE FALSE 2 SUT476 8 92 0.08695652
## 2 WT:3 WT FALSE FALSE 3 SUT476 10 503 0.01988072
## 3 WT:4 WT FALSE FALSE 4 SUT476 28 989 0.02831143
## 4 WT:5 WT FALSE FALSE 5 SUT476 23 1236 0.01860841
## 5 WT:6 WT FALSE FALSE 6 SUT476 31 1122 0.02762923
## 6 WT:7 WT FALSE FALSE 7 SUT476 47 1054 0.04459203
## log_norm_count .fitted .se.fit .resid .hat .sigma
## 1 -3.518249 -5.041789 0.4865111 1.5235400 0.3777778 0.3304857
## 2 -5.629393 -4.953660 0.4044709 -0.6757333 0.2611111 0.7924442
## 3 -5.126217 -4.865530 0.3337439 -0.2606864 0.1777778 0.8468699
## 4 -5.723242 -4.777401 0.2829451 -0.9458414 0.1277778 0.7483436
## 5 -5.161007 -4.689271 0.2638477 -0.4717353 0.1111111 0.8302040
## 6 -4.476729 -4.601142 0.2829451 0.1244130 0.1277778 0.8532328
## .cooksd .std.resid
## 1 1.80748195 2.4400939
## 2 0.17427640 -0.9931416
## 3 0.01426122 -0.3632028
## 4 0.11991088 -1.2794703
## 5 0.02497354 -0.6321208
## 6 0.00207469 0.1682974
# augment() can also be used to produce predictions for new inputs
augment(model, newdata=sut476_wt[1:5,])
## sample strain set1 rrp6 time gene count norm norm_count
## 1 WT:2 WT FALSE FALSE 2 SUT476 8 92 0.08695652
## 2 WT:3 WT FALSE FALSE 3 SUT476 10 503 0.01988072
## 3 WT:4 WT FALSE FALSE 4 SUT476 28 989 0.02831143
## 4 WT:5 WT FALSE FALSE 5 SUT476 23 1236 0.01860841
## 5 WT:6 WT FALSE FALSE 6 SUT476 31 1122 0.02762923
## log_norm_count .fitted .se.fit
## 1 -3.518249 -5.041789 0.4865111
## 2 -5.629393 -4.953660 0.4044709
## 3 -5.126217 -4.865530 0.3337439
## 4 -5.723242 -4.777401 0.2829451
## 5 -5.161007 -4.689271 0.2638477
# Let's look at the model fit, and how influential each observation was
augment(model, sut476_wt) %>%
ggplot(aes(x=time, y=log_norm_count)) +
geom_point(aes(size=.cooksd), alpha=0.5) +
geom_line(aes(y=.fitted))
We see the initial timepoint in the wildtype strain is wildly off the linear trend. The large Cook’s Distance indicates this is strongly affecting the model. Possibly early time points need to be removed, or (more likely!) a linear trend is not a good model of this data.
Linear models can be much more complicated than this. They can include multiple explanatory variables, including categorical variables, and interactions between them, and even polynomials.
model2 <- lm(log_norm_count ~ strain*poly(time,3), data=sut476)
model2
##
## Call:
## lm(formula = log_norm_count ~ strain * poly(time, 3), data = sut476)
##
## Coefficients:
## (Intercept) strainSET1
## -4.6893 1.4081
## strainRRP6 strainSET1-RRP6
## 4.2133 3.5053
## poly(time, 3)1 poly(time, 3)2
## 1.3653 2.8553
## poly(time, 3)3 strainSET1:poly(time, 3)1
## -2.4314 2.0542
## strainRRP6:poly(time, 3)1 strainSET1-RRP6:poly(time, 3)1
## -2.0608 -0.5208
## strainSET1:poly(time, 3)2 strainRRP6:poly(time, 3)2
## -3.9125 -2.9297
## strainSET1-RRP6:poly(time, 3)2 strainSET1:poly(time, 3)3
## -2.1218 1.9517
## strainRRP6:poly(time, 3)3 strainSET1-RRP6:poly(time, 3)3
## 2.1368 2.2349
augment(model2, sut476) %>%
ggplot(aes(x=time, y=log_norm_count, color=strain, group=strain)) +
geom_point(aes(size=.cooksd), alpha=0.5) +
geom_line(aes(y=.fitted))
sessionInfo()
## R version 3.3.0 (2016-05-03)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: OS X 10.11.4 (El Capitan)
##
## locale:
## [1] en_AU.UTF-8/en_AU.UTF-8/en_AU.UTF-8/C/en_AU.UTF-8/en_AU.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets base
##
## other attached packages:
## [1] broom_0.4.1 viridis_0.3.4 ggplot2_2.1.0 dplyr_0.5.0 tidyr_0.5.1
## [6] readr_0.2.2
##
## loaded via a namespace (and not attached):
## [1] Rcpp_0.12.5 knitr_1.13 magrittr_1.5 mnormt_1.5-4
## [5] munsell_0.4.3 lattice_0.20-33 colorspace_1.2-6 R6_2.1.2
## [9] stringr_1.0.0 plyr_1.8.4 tools_3.3.0 parallel_3.3.0
## [13] grid_3.3.0 nlme_3.1-128 gtable_0.2.0 psych_1.6.6
## [17] DBI_0.4-1 htmltools_0.3.5 lazyeval_0.2.0 yaml_2.1.13
## [21] assertthat_0.1 digest_0.6.9 tibble_1.1 gridExtra_2.2.1
## [25] reshape2_1.4.1 formatR_1.4 evaluate_0.9 rmarkdown_1.0
## [29] labeling_0.3 stringi_1.1.1 methods_3.3.0 scales_0.4.0